Record Matching in Digital
نویسنده
چکیده
When data stores grow large, data quality, cleaning, and integrity become issues. The commercial sector spends a massive amount of time and energy canonicalizing customer and product records as their lists of products and consumers expand. An Accenture study in 2006 found that a high-tech equipment manufacturer saved $6 million per year by removing redundant customer records used in customer mailings. In 2000, the U.K. Ministry of Defence embarked on the massive “The Cleansing Project,” solving key problems with its inventory and logistics and saving over $25 million over four years. In digital libraries, such problems manifest most urgently not in the customer, product, or item records, but in the metadata that describes the library’s holdings. Several well-known citation lists of computer science research contain over 50% duplicate citations, although none of these duplicates are exact string matches [2]. Without metadata cleaning, libraries might end up listing multiple records for the same item, causing circulation problems, and skewing the distribution of their holdings. In addition, when different authors share the same name (for example, Wei Wang, J. Brown), author disambiguation must be performed to correctly link authors to their respective monographs and articles, and not to others. Metadata inconsistencies can be due to problems with varying ordering of fields, type of delimiters used, omission of fields, multiple representations of names of people and organizations, and typographical errors. When libraries import large volumes of metadata from sources that follow a metadata standard, a manually compiled set of rules called a crosswalk may be used to transform the metadata into the library’s own format. However, such crosswalks are expensive to create manually, and public ones exist only for a few, well-used formats. Crucially, they also do not address how to detect and remove inexact duplicates. As digital libraries mine and incorporate data from a wider variety of sources, especially noisy sources, such as the Web, finding a suitable and scalable matching solution becomes critical. Here, we examine this problem and its solutions. The de-duplication task takes a list of metadata records as input and returns the list with duplicate records removed. For example, the search results shown in the figure here are identical and should have been combined into a single entry. It should be noted that many disciplines of computer science have instances of similar inexact matchTechnical Opinion
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملColor scene transform between images using Rosenfeld-Kak histogram matching method
In digital color imaging, it is of interest to transform the color scene of an image to the other. Some attempts have been done in this case using, for example, lαβ color space, principal component analysis and recently histogram rescaling method. In this research, a novel method is proposed based on the Resenfeld and Kak histogram matching algorithm. It is suggested that to transform the color...
متن کاملApproximate String Comparison and its Effect on an Advanced Record Linkage System
Record linkage, sometimes referred to as information retrieval (Frakes and Baeza-Yates, 1992) is needed for the creation, unduplication, and maintenance of name and address lists. This paper describes string comparators and their effect in a production matching system. Because many lists have typographical errors in more than 20 percent of first names and also in last names, effective methods f...
متن کاملنقش مدارک پزشکی جهت پیاده سازی مدل EFQM در بیمارستان
Background and Aim: Information is a factor for organizations success and organizations try to stay in this competitive world. In each organization, there are sections that have special role in information aspect in hospitals and healthcare centers, this role is for medical record section which organizes all of the patients' health care information. Paying attention to function quality in this ...
متن کاملOptimizing image steganography by combining the GA and ICA
In this study, a novel approach which uses combination of steganography and cryptography for hiding information into digital images as host media is proposed. In the process, secret data is first encrypted using the mono-alphabetic substitution cipher method and then the encrypted secret data is embedded inside an image using an algorithm which combines the random patterns based on Space Fillin...
متن کاملEvaluation of Similarity Measures for Template Matching
Image matching is a critical process in various photogrammetry, computer vision and remote sensing applications such as image registration, 3D model reconstruction, change detection, image fusion, pattern recognition, autonomous navigation, and digital elevation model (DEM) generation and orientation. The primary goal of the image matching process is to establish the correspondence between two ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008